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experiments are given to show the efficiency of the new algorithm on CPU 
and GPU for the solution of large sparse matrix equation. 
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1 Introduction 


The solution of linear systems of equations is at the heart of many compu- 
tations in science, engineering, and other disciplines; see [2,8-10] and their 
references. Hence, many researches have been performed on various types of 
matrix equations; for example, see [1,8, 11, 16,18, 19,27, 28]. 

The principal goal of this paper is to use support graph preconditioning 
techniques to solve the matrix equation 


AXB=C, (1) 


where A € R”*” and B € R™*"™ are large sparse Stieltjes matrices. The 
linear matrix equation (1) can be written as the following nm x nm linear 
system: 

(BT @ A)vec(X) = vec(C), (2) 


where vec(X) is the vector of R”” obtained by stacking the columns of the 
nx m matrix X and ® denotes the Kronecker product (A ® B = [a;;B]j;). 
The CG algorithm [24] can be used to solve the linear system (2). However, 
for large problems, this approach cannot be applied directly. In addition, the 
number of iterations of conjugate gradient method for the solution of linear 
system of equations Ax = b is bounded by the square root of the spectral 
condition number «(A) of A. The condition number is the ratio of extreme 
eigenvalues of A, K(A) = Amax(A)/Amin(A). Preconditioner accelerates the 
convergence of iterative methods for solving linear systems. 

In this paper, we use the preconditioned global conjugate gradient (PGL- 
CG) method for obtaining the approximate solution of matrix equation (1). 
The preconditioner is based on the support graph preconditioners. Predeces- 
sors of support-graph methods can be found in the work from the late 80s by 
Notay, Beauwens, and collaborators in which graph-theoretic notions (princi- 
pally paths) are used in the analysis of preconditioners; see [3,4,21—23]. These 
insights were extended by Vaydia [26], who described his work in a talk in 
1991 but did not publish a paper. Vaidya used support-graph techniques to 
design a family of preconditioners based on spanning trees in graphs. Later, 
Gremban, Miller, and Zagha [14,15] extended the technique and used it to 
construct another family of preconditioners. In Section 3, we use Vaidya’s 
maximum spanning tree preconditioners of matrices A and B for developing 
fast and efficient preconditioner to precondition equation (2). 

Throughout this paper, all matrices are assumed to be real. For two 
matrices X,Y € R"™S, the inner product (X,Y) = (Y7X) is used and the 
associated norm is the Frobenius norm denoted by ||.||r. 


The rest of the paper is organized as follows. In the next section, we 
implement the preconditioned global CG method for solving matrix equation 
(1) and we introduce Vaidya’s maximum spanning tree preconditioner. In 
section 3, we present a new algorithm for computing the inverse of this kind 


A new approximate inverse preconditioner based on ... 3 


of preconditioners. In section 4, numerical examples are given to illustrate 
the efficiency of the proposed preconditioner. Conclusions are summarized 
in Section 5. 


2 Preconditioned GL-CG method for solving the matrix 
equation AX B=C 


In this section, we consider the matrix equation AX B = C, where A and B 
are symmetric and positive definite and assume that the preconditioners P, 
and Pp are available. The preconditioners P4 and Pg are the matrices that 
approximate A and B in some sense, respectively. It is assumed that P, and 
Pp are also symmetric positive definite. Then, we can precondition system 
(1) as follows: 


(Pg ® P4)~'(B @ A)vec(X) = (Pg @ Pa) ‘vec(C), (3) 


where the preconditioner (Pp ® P4) is a symmetric positive definite matrix. 
In addition, from the fact that ||A @ B|| = ||A]|||B]| [17], we have 


cond((Pg ® P4)~'(B ® A)) = cond(P3'B)cond(P,' A). 


The straightforward application of PCG algorithm [24] to the linear system 
(3) yields the following preconditioned global CG algorithm for solving the 
matrix equation (1). 


Algorithm 1 PGL-CG for solving AXB=C 
1. Compute Ro = C — AXoB, Zp = Py RoPyz’ and Py = Zo 


2. for 7 =0,1,..., until convergence do 
3  _ _<Rj,Zj>Rr 
- OF = <AP;B,Pj>6 


4 Xj41 = Xj + aj,P; 
5 Ry41 = R; — aj; APB 
6. Zp = Py ReiPs 
7. 8; = <Rj41,Zj41>F 
8 
9. 


<Rj,4j>Pr 
» Pip = Zi41 + By Pj 
end for 


We focus on applying Vaidya’s preconditioner of the first class to the ma- 
trices A and B for constructing the preconditioners P4 and Pg. In order 
to explain Vaidya’s preconditioner, we first present the following definition 
from [7]. 


Definition 1. The underlying graph G4 = (Va, Fa) of an n-by-n symmetric 
matrix A is a weighted undirected graph whose vertex set is V4 = {1,2,...,n} 
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and whose edges set is Ey = {(i,j):7 4 J,ai,7 # O}. The weight of an edge 
(i,j) is —a;,;. The weight of a vertex 7 is the sum of elements in the row i of 
A. 

Graph preconditioner, introduced by Vaidya [26] in the early nineties, 
uses maximum-weight spanning tree (MWST) preconditioners to bound the 
condition number of a preconditioned system. Vaidya’s method constructs a 
preconditioner M whose underlying graph G'y), is a subgraph of G4 (graph of 
A). The graph Gy of preconditioner has the same set of vertices as G4 and 
a subset of the edges of G4. Vaidya proposed two classes of precondition- 
ers. The first class of MWST preconditioners guarantees a condition number 
bound of O(n?) for any n x n sparse diagonally dominant symmetric (SDD) 
matrix; see [7]. The second class of preconditioners is based on MWST aug- 
mented with a few extra edges. This class of preconditioners guarantees that 
the work in the linear solver is bounded by O(n!:"°) for any sparse diagonally 
dominant matrix. In this paper, we focus on applying Vaidya’s precondition- 
ers of the first class to a subclass of SDD matrices, the class of SDD matrices 
with nonpositive off-diagonal elements (Stieltjes matrices). 

In order to construct the MWST preconditioner P, for A, we first con- 
struct the maximum-weight spanning tree T,4 in G4 and then modify the 
diagonal elements of preconditioner P,4 such that A and Py, have the same 
row sums. In other words, T,4 is a connected graph with no cycles (i.e., a 
spanning tree), and the total weight of its edges is maximal among all span- 
ning trees of G4. The preconditioner P, is a diagonally dominant Stieltjes 
matrix whose underlying graph is Gp, = T4, and whose row sums are iden- 
tical to those of A. 

When the condition number of the matrix B® A is high, it becomes nec- 
essary to develop a fast and efficient preconditioner for the iterative solution 
of (2). In order to precondition the system (2), we first construct Vaidya’s 
preconditioners (maximum-weight spanning tree) P4 and Pg for A and B, 
respectively, and then we use Pg ® P, as a preconditioner for the matrix 
B@®A. The implementation of this preconditioner is based on computation 
of the inverse matrices re and Pr 1 In Section 3, we show that, by using 
the breadth first search (BFS) algorithm [25], we can easily compute these 
inverse matrices. 


3 Computation of inverse of a MWST preconditioner 


Let M be a symmetric positive definite matrix whose underlying graph Tj, is 
a tree. In order to compute the inverse of M, we need the following definition. 


Definition 2. The elimination matrix L,,(—a) € R"*”" with p ¥ q, is 
an identity matrix with one nonzero off-diagonal entry in the row p and the 
column q. Therefore the entries of Lp,(—q) are as follows: 
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1 if 7 =, 
(Lpq(—@)) (4,3) = —a_ if (i, 9) = (p,q); 
0 otherwise. 


Now we investigate the result of symmetric transformation 
M = Lyq(—a)M LF, (-a). (4) 


Now Lpq(—a)M changes only the row p of M, while M LT (-a) changes only 
the column p. Thus, multiplying out equation (4) and using the symmetry 
of M, we get the explicit formulas 


Mpj = Nip = Mpj — oa i FD, 
Mpp = Mpp — ZAMpg + A Mgq, (5) 
Miz = M54 = Miz otherwise. 


The idea of our method is to try to zero the off-diagonal elements of MW 
by a series of transformations (4) and using the leaves of graph Tyy and its 
subtrees. 


Let us assume that the vertex q is a leaf in Ty, and its neighbor is the 
vertex p. In order to zero the off-diagonal m,,, accordingly, to set Mp, = 0, 
equation (5) gives the following expression for the parameter a: 


a=—. (6) 
From (5) and (6), the entries of M = Lyq(—a)M LJ, (—a) are as follows: 


see 2 
Mpp = Mpp — 2ZAMpq + A" Maq, 

Mpg = Mqp = O, 

Mii = Miz otherwise. 


This process which eliminates the nonzero entries Mpg and mg, of M, 
is equivalent to eliminate the edge (p,q) from the tree Ty. If M denotes 


the matrix obtained from the matrix M by removing the row q and the 
column q, then it is trivial that the underlying graph of this submatrix is 
the induced subtree of Tyy on V(Tyr) — {gq}. Let My = Mypi = pia = 
g,a1, = a, My = Lyq(—a)MLyq(—a@)", and Mz = M; then, we can succes- 
sively transform M to diagonal form by means of transformations of the 
type (4) in (n — 1) steps with the elimination matrices Ly,,q,(—a;) and 
aj = (Mp,,q;/Mq;,q;),3 = 1,2,...,n —1, which are defined by choosing the 
edges (p;,9;),j = 1,2,...,n —1 such that the vertex q; is a leaf in the 
subtree Tyy;,. To achieve this, we need to apply the BFS algorithm (Al- 
gorithm 2) to the maximum-weight spanning tree Ty, to obtain the vector 
V = [ji,j2,---;Jn], which represents an array of vertices that are traversed 
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and sorted by the BFS algorithm and Level(u;),j = 1,2,...,n, which rep- 
resent the level of traversed vertices u;,7 = 1,2,...,n in the BFS tree. 

By using the array of vertices V = [j1, j2,---,jn], we can diagonalize the 
matrix M in n—1 steps. In step k, k = 1,2,...,n—1, by choosing the vertex 
dk = Jn—k+1 from V and considering its parent py = in—~+1 and the edge 
(pk, dx), we define the elimination matrix Lp, 4, (—ax) for eliminating the off- 
diagonal element mp, .q,- In Lemma 1, we show that the vertex qx = jn—k+1 
at step k is a leaf in the subtree Tyy,. In what follows, we show that by using 
Level (u;), j = 1,2,...,n, we can reduce the overall time of producing the 
inverse of M. 


Algorithm 2 Breadth first search algorithm as BFS(G,s) 


V=6 
for each vertex u € V(G) — s do 
state(u) = ” undiscovered” 
p(u) = nil, i.e. no parent is in the BFS tree 
end for 
state(s) =” discovered” 
V=VU{s} 
Level(s) =0 
p(s) = nil 
Q = {s} 


while Q 4 @ do 
u = dequeue(Q) 
process vertex u as desired 
for each v € Adj(u) do 
process edge (u,v) as desired 


if state(v) =” undiscovered” then 
state(v) = ” discovered” 
V=VU{v} 
pv) =u 


Level(v) = Level(p(v)) +1 
enqueue(Q, v) 
end if 
state(u) = ” processed” 
end for 
end while 


Lemma 1. Let V = {j1, j2,..-, jn} be the set of vertices obtained by the BFS 
algorithm. If we isolate the vertex jn and Thin) denotes the induced subtree 
of Ty on the verter set V(Tm) — {jn}, then jn—1 is a leaf in the induced 
subtree Te), 


Proof. Let i, be the parent of 7, and Level(j,) = 1. According to the BFS 
algorithm, if s < t, then Level(j,) < Level(j,). Suppose that we isolate the 
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vertex jn; then we must consider the Level(jn_1). If Level(jn—1) = 1, then 
it is trivial that the vertex j,_1 is a leaf in the induced subtree ro. If 
Level(jn—1) = 1—1, then it means that there is no vertex in level 1, so the 
vertex jn—1 has no children, according to the BFS algorithm, and it is a leaf 
in the subtree qk, 


Let M be Vaidya’s maximum-weight spanning tree preconditioner for the 
diagonally dominant spd matrix A and let V = {j1, j2,...,jn} be the array 
obtained by the BFS algorithm. We observe that we can diagonalize M by 
n — 1 elimination matrices Ly, 4,(—ax), k = 1,2,...,n—1, where q, = 
jn—k+1- 50, the elimination process yields the diagonal matrix D as follows: 


De ly ale oF Mi. git 


n—-1> 


where Lj, = Lp, q,(—Qk),k = 1,2,...,2—1. Therefore, we have 
Mo = (D4 Tig clay Dg ahi oe Ty) 
In addition, by supposing that level(j,,) = 1, we can write 


M~! = (G,Gq...G,)7D71(GiGe...G)), 


where Gy = Ly,45,-1---Lv, for k = 1,...,1, and sy, represents the num- 
ber of vertices that have the level k in the graph Ty,, and the matrices 
Dy,+s,—-1)---, Ly, are generated by the vertices jn_(1,49,—-1)41) +++ In—vet1) 


which have level k. In Lemma 2, we show that the nonzero off-diagonal 
elements of Gz, are equal to the nonzero off-diagonal elements of matrices 


Ly ts,—15 ae) Ly,- 


Lemma 2. Let Sp = {Jn—(y,+s,-1)+1)-++>Jn—- +1} be the set of vertices 
in level k of algorithm BFS and let Gy = Ly ,45,-1---Lv,, where Ly = 
Ly,.,q.(—Qr) for rT = Ve,--+,Ve+s,—-1 and pr is the parent of dr = jn—r41- 


Then the entries of Gy are as follows: 
1 ifi=J, 
—a, = —Teee if (i, 9) = (Pr, Qr), Gr = Jn—-r+1 © Sk, 


and (p;, gr) € E(Ta), 
0 otherwise. 


Proof. From the definition of elimination matrix L, = Ly, q,(—a,), we have 
Ly = Lp,,q,(—Or) = I — Or Ep,,ays 


where Ey,.q, contains only Os except for 1 in the (p,,q,)th position. From 
the fact that all the vertices in S; have level k, for all q-(= jn—r41), Ge’ (= 
Jn—r'+1) © Sp, we have 
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Ep,.,ar x Ey sq =0 for dr # Pr’. 


So, by the induction on the number of elimination vertices in level k, we can 


easily show that 
Vetsp—-1 


Gr =J- S- ApEn, aps 


Tr=V_E 


which completes the proof. 


Finally, summarizing the previous results, we describe the tree inverse 
algorithm for computing the inverse of M as follows: 


Algorithm 3 Tree inverse 


input(T, root) 
(uv, Level) = BF'S(T4, root) 
Mat 
d = Diag(A) 
for k = 1 to 1 step -1 (J is the number of levels obtained from the BFS 
algorithm) do 
G=I 
for all vertices in level k do 
j =current vertex 
i =the parent of current vertex 


Mig 


4 Numerical experiments 


In this section, we compare the experimental results obtained by solving the 
preconditioned system of equation (1). Four preconditioners will be com- 
pared: MWST, AINV (right-looking version) [5], incomplete Cholesky, and 
RIF [6] preconditioners. In addition, the following approaches are used for 
applying the MWST preconditioners: 


1. We use the matrices Pye and Pz ' computed by the tree inverse algo- 
rithm (Algorithm 3). 
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2. We use the Cholesky factorization of the MWST preconditioners P,4 
and Pg for computing Z;,; in lines 1 and 6 of Algorithm 1. 


3. In order to reduce the fill-in, first, we apply the reverse Cuthill-McKee 
ordering [12,13] to the preconditioners P4 and Pg and then we use the 
Cholesky factorizations of the resulting matrices. 


Finally, for large matrices, we compare the results obtained by the approach 1 
on GPU and CPU. 

The examples have been coded in MATLAB with double-precision and 
have been executed on a quad-processor 4.2 GHz i7 computer with 32 GBytes 
of main memory. In all examples, the initial iteration matrix is zero. We stop 
the iterations when 

RError = Fell e 


= €, 
I|Rolle ~ 


where R;, is the residual of the kth iterate and € is a proper stopping tolerance. 
In all the tables, the CPU time is in second and a dagger ({) indicates that 
no convergence is achieved after 10000 iterations except for Tables 5 and 6, 
where the maximum number of iterations is 30000. We also set the stopping 
tolerance 10~°. The matrix C is chosen such that the exact solution X has 
the entries x;; =i*j fori=1,...,n andj =1,...,m. 

For the first set of examples, we consider the matrix NOS6 from Harwell— 
Boeing collection [20] and the matrix ST,,, which is obtained by discretizing 
the poisson equation 


Ou Ou 


ie? oye in Q = ]0,1[ x JO, 1[ 


with the Dirichlet boundary condition on a uniform grid of mesh size h = ait 
via central differences. These matrices with their properties are presented in 
Table 1. In this table, cond denotes the condition number of the matrices in 


2-norm. 


Table 1: First set of test problems information 


Test matrix on nnz cond 
STs 25 105 20.77 
STi0 100 460 69.8634 
ST20 400 1920 258.4520 
ST30 900 4380 564.9227 
ST40 1600 7840 989.2690 
STs0 2500 12300 1531.5 
Nos6 675 3255 8x 10% 


In Table 2, we compare the number of iterations (It) and the CPU itera- 
tion time (It-time) for the preconditioners: the approximate inverse with drop 
tolerance equal to 0.1 (AINV) [5] , the incomplete Cholesky factorization 
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Table 2: Number of iterations and CPU times to converge for the first set of examples 


atrices MWST, MWST» MWST3 AINV IC(0) RIF 

A B It It-time It It-time It It-time It It-time It It-time It It-time 

NOS6 STs 207 0.16 196 0.20 194 0.11 431 0.24 273 0.20 199 0.29 

NOS6 STio 499 0.97 485 1.86 486 1.63 1110) 2.08 575 1.71 405 1.59 

NOS6 = ST20 1131 16.78 1115 35.49 1115 = 20.93 2738 34.58 1353 24.61 962 21.27 

NOS6 = ST30 1817 84.99 1797 162.92 1797 ~=100.26 4936 235.54 2285 136.82 1591 105.34 

NOS6 = ST40 2489 308.60 2470 =512.05 2470 = 335.12 7771 = 978.51 3381 463.15 2396 385.51 

NOS6 — ST50 3167 884.18 3152 = 1283.7 3148 = 873.3 t t 4570 1242.9 3332 1094.4 

NOS6 NOS6~ 725 22.90 694 35.25 694 23.12 2016 =38.51 1737 62.65 766 35.05 

Table 3: Preconditioning times and total times for the first set of examples 

Matrices MWST, MW ST>2 MW ST3 AINV IC(0) RIF 
A B P-time T-time P-time T-time P-time T-time P-time T-time P-time T-time P-time  T-time 
NOS6 STs 0.45 0.61 4.61 4.81 0.85 0.96 0.40 0.64 1.08 1.28 0.27 0.56 
NOS6 STio0 0.49 1.46 4.70 6.56 0.87 2.50 0.40 2.43 1.11 2.82 0.29 1.82 
NOS6 = ST20 0.65 17.43 7.25 42.74 1.10 22.03 0.48 35.06 1.44 26.05 0.39 21.66 
NOS6 = ST30 0.92 85.91 23.91 188.83 2.10 102.36 1.16 236.7 2.97 139.79 0.89 106.23 
NOS6 = ST40 1.34 309.94 85.63 598.13 4.88 340.00 4.89 983.49 6.92 470.07 3.53 389.04 
NOS6 = S750 2.15 886.33 248.50 1532.20 10.35 883.65 17.02 t 15.25 1258.15 10.84 1105.24 
NOS6 NOS6 _— 0.80 23.70 9.19 44.44 1.63 24.75 0.78 59.29 2.14 64.79 0.56 35.61 
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(IC(0)), MWST using the approaches 1-3 (MWST,, MWST>2, MWSTs, 
respectively), and the robust incomplete factorization with drop tolerance 
equal to 0.1 (RIF). The CPU time for computing the preconditioner (P- 
time) and the total time for computing an approximate solution (T-time) are 
given in Table 3. Table 2 reveals that the preconditioner MW ST; is faster 
(in terms It-time) than the other preconditioners (except for MW ST3 with 
A =NOS6 and B = STs, STs9) and it requires a lower number of iterations 
than AINV and IC(0) preconditioners. Table 3 shows that the precondi- 
tioners MW ST, is faster (in terms T-time) than the other preconditioners 
(except for MWST3 with A =nos6 and B = STs, and RIF for NOS6 and 
STs). In addition, for large matrices (NOS6 with STyo, and ST509), MWST, 
preconditioner is better (in terms of P-time) than the other preconditioners. 
For small matrices (NOS6 with STs, ST19, ST29, and S739), we observe that 
the time of constructing the preconditioner RIF is smaller than that of the 
other preconditioners. For the second set of examples, we define matrices 
STM, = ST, + DI,, where DI, is a diagonal matrix such that the matrix 
STM,, has zero row weights (except for one row, where we increase the row 
sums to obtain a nonsingular matrix). Table 4 represents the properties of 
these matrices. 


Table 4: Second set of test problems information 


Test matrix n nnz cond 

STMs 25 105 2.0002 x 10 
STMio 100 460 8.0012 x 106 
STMa0 400 1920 3.2006 x 107 
STM30 900 4380 7.2016 x 107 
STM,4o 1600 7840 1.2803 x 108 
STMs0 2500 12300 2.0005 x 108 


The results obtained for these matrices (which have large condition num- 
ber) are presented in Tables 5 and 6. The results of Table 5 show that 
MWST, is the best in terms of iteration time (except for MWST3 with 
A= STM;, B = STMsz9 and RIF with A = STM2), B = ST Mo). From 
the results of Table 6, we observe that, for large matrices, MW ST; is bet- 
ter (in terms of total time) than the other preconditioners (except for RIF 
with A = ST Mo, B= ST Myo and A = ST Mo, B= ST Mo). Finally, 
we consider the results obtained for the preconditioner MW ST, in terms of 
CPU time, GPU time, and the number of iterations. We mention that the 
preconditioner was computed on the CPU. All numerical experiments in this 
section were computed in double precision with a MATLAB code. We used 
a Geforce GTX 1070 GPU with 8 GBytes VRAM memory. The results are 
listed in Tables 7 and 8. The notations CIT (GIT) and CIT-time (GIT-time) 
represent the number of iterations and CPU iteration time (GPU iteration 
time) on the CPU (GPU) required for convergence, respectively. These ta- 
bles show that the number of iterations for the CPU and the GPU are close 


K. Rezaei, F. Rahbarnia and F. Toutounian 


12 


Table 5: Number of iterations and CPU times to converge for the second set of examples 


Matrices MWST, MWST»2 MWST3 AINV IC(0) RIF 

A B It It-time It It-time It It-time It It-time It It-time It It-time 

ST Ms ST Ms 101 0.01 85 0.01 84 0.01 195 0.02 165 0.01 83 0.01 

ST Ms STMio 390 0.03 360 0.06 357 0.05 1135 0.11 669 0.10 447 0.08 

ST Ms ST M20 = 932 1.06 882 1.52 881 1.09 4745 5.87 2152 3.31 1585 = 2.28 

ST Ms ST M30 1491 8.86 1437 9.67 1435 8.50 i" t 4299 25.19 3327 924.73 

STMio STMi9 831 0.18 768 0.49 762 0.29 084 0.27 662 0.31 431 0.18 

STMio STM29 2264 = 4.72 2187 = 8.63 2186 5.66 7074 =-14.41 3149 9.62 2322 ~=6.30 

STMio STM3z9 3596 334.00 3520 52.34 3514 34.80 i" tT 6589 66.48 5141 60.66 

STM20 STM20 4568 36.58 4420 89.29 4419 44.67 9185 69.22 3721 41.84 2667 29.07 

STM20 STM39 7599 216.78 7517 461.86 7508 253.27 i" tT tT Tt 8309 = 333.13 

Table 6: Preconditioning times and total times for the second set of examples 

Matrices MWST, MW ST» MWST3 AINV IC(0) RIF 
A B P-time T-time P-time T-time P-time T-time P-time T-time P-time T-time P-time T-time 
STMs STM5 0.04 0.05 0.04 0.05 0.02 0.03 0.02 0.04 0.02 0.03 0.02 0.03 
STM; STMio 0.07 0.10 0.13 0.19 0.05 0.10 0.02 0.13 0.04 0.14 0.02 0.10 
STMs STMo20 (0.22 1.28 2.80 4.32 0.27 1.36 0.10 5.97 0.38 3.69 0.10 2.38 
STMs STM30 ~——0.47 9.33 20.29 29.96 1.55 10.05 0.82 tT 1.88 27.07 0.61 25.34 
STMigo STMio0_ 0.10 0.28 0.22 0.71 0.08 0.37 0.02 0.29 0.06 0.37 0.02 0.20 
STMio STM 0.25 4.97 2.89 11.52 0.30 5.96 0.10 14.51 0.40 10.02 0.10 6.4 
STMi9 STM30 0.50 34.50 20.38 72.72 1.58 36.38 0.82 t 1.90 68.38 0.61 61.27 
STM20 STM20_~ 0.40 36.98 5.56 94.85 0.52 45.19 0.18 70.30 0.74 42.58 0.19 29.26 
STM29 STM30 0.65 217.43 23.05 484.91 1.8 255.07 0.90 t 2.24 t 0.69 333.82 
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together and that the GPU time is very smaller than the CPU time for large 
matrices. So, we can conclude that MW ST, preconditioner in the GL-CG 
method offers a great potential in a parallel processing environment. 


Table 7: The performance of the preconditioner MW ST, on CPU and GPU 
for the first set of examples 


Matrices MWST, 

A B Cit Clit-time GIt GlIt-time 
NOS6 ST5 207 0.16 208 0.56 
NOS6 — STi0 499 0.97 500 1.39 
NOS6 ST20 1131 16.78 1131 11.95 
NOS6 = ST30 1817 84.99 1815 57.91 
NOS6 = ST4o 2489 308.60 2487 200.15 
NOS6 = ST50 3167 884.18 3162 591.95 
NOS6 NOS6~ 725 22.90 729 18.10 


Table 8: The performance of the preconditioner MW ST, on CPU and GPU 
for the second set of examples 


Matrices MWST, 

A B Cit Clt-time GIt GIt-time 
STMio STMio 831 0.18 833 1.09 

ST Mio ST M20 2264 4.72 2267 4.73 
STMio STM3q9 3596 34.00 3604 20.35 


STMio STM 4926 152.78 4931 71.82 
STMio STMs0 6274 488.41 6287 197.01 
STM20 STM29 4568 36.58 4578 24.91 
STM20 STM30 7599 216.78 7611 143.72 
STM29 STMao 10343 840.34 10354 520.89 
STM20 STMso 13010 2234.6 13027 =1500.6 
STM30 STMz9 11171 712.07 11199 = 491.93 
STM30 STMao 15489 2563.8 15504 )=—-:1731.5 
STM39 STMso 19497 6225.7 19521 4999.5 
STMao STMao 20170 6165.1 20207 4156.8 
STMao STMs59 25848 15348.00 25873 12075.00 
STMs0 STMs50 31525 30637.00 31573 24410.00 


5 Conclusion 


We have proposed an approach for computing an approximate solution of 
matrix equation AX B = C, where A and B are Stieltjes matrices. In this 
approach, by using the BFS algorithm, we presented a new algorithm for 
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obtaining the inverse of Vaidya’s maximum spanning tree preconditioner as 
an approximate inverse preconditioner. This preconditioner does not require 
solving any linear systems and is highly parallelizable. We observed that this 
algorithm furnishes an efficient preconditioner for the matrix equations. The 
numerical experiments showed that, for large matrices, this preconditioner is 
better than the other preconditioner in terms of iteration time and total time 
and the new algorithm is very efficient on the GPU. 
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